886 resultados para leave one out cross validation


Relevância:

100.00% 100.00%

Publicador:

Resumo:

A fundamental principle in practical nonlinear data modeling is the parsimonious principle of constructing the minimal model that explains the training data well. Leave-one-out (LOO) cross validation is often used to estimate generalization errors by choosing amongst different network architectures (M. Stone, "Cross validatory choice and assessment of statistical predictions", J. R. Stast. Soc., Ser. B, 36, pp. 117-147, 1974). Based upon the minimization of LOO criteria of either the mean squares of LOO errors or the LOO misclassification rate respectively, we present two backward elimination algorithms as model post-processing procedures for regression and classification problems. The proposed backward elimination procedures exploit an orthogonalization procedure to enable the orthogonality between the subspace as spanned by the pruned model and the deleted regressor. Subsequently, it is shown that the LOO criteria used in both algorithms can be calculated via some analytic recursive formula, as derived in this contribution, without actually splitting the estimation data set so as to reduce computational expense. Compared to most other model construction methods, the proposed algorithms are advantageous in several aspects; (i) There are no tuning parameters to be optimized through an extra validation data set; (ii) The procedure is fully automatic without an additional stopping criteria; and (iii) The model structure selection is directly based on model generalization performance. The illustrative examples on regression and classification are used to demonstrate that the proposed algorithms are viable post-processing methods to prune a model to gain extra sparsity and improved generalization.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

tWe develop an orthogonal forward selection (OFS) approach to construct radial basis function (RBF)network classifiers for two-class problems. Our approach integrates several concepts in probabilisticmodelling, including cross validation, mutual information and Bayesian hyperparameter fitting. At eachstage of the OFS procedure, one model term is selected by maximising the leave-one-out mutual infor-mation (LOOMI) between the classifier’s predicted class labels and the true class labels. We derive theformula of LOOMI within the OFS framework so that the LOOMI can be evaluated efficiently for modelterm selection. Furthermore, a Bayesian procedure of hyperparameter fitting is also integrated into theeach stage of the OFS to infer the l2-norm based local regularisation parameter from the data. Since eachforward stage is effectively fitting of a one-variable model, this task is very fast. The classifier construc-tion procedure is automatically terminated without the need of using additional stopping criterion toyield very sparse RBF classifiers with excellent classification generalisation performance, which is par-ticular useful for the noisy data sets with highly overlapping class distribution. A number of benchmarkexamples are employed to demonstrate the effectiveness of our proposed approach.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Given n training examples, the training of a Kernel Fisher Discriminant (KFD) classifier corresponds to solving a linear system of dimension n. In cross-validating KFD, the training examples are split into 2 distinct subsets for a number of times (L) wherein a subset of m examples is used for validation and the other subset of(n - m) examples is used for training the classifier. In this case L linear systems of dimension (n - m) need to be solved. We propose a novel method for cross-validation of KFD in which instead of solving L linear systems of dimension (n - m), we compute the inverse of an n × n matrix and solve L linear systems of dimension 2m, thereby reducing the complexity when L is large and/or m is small. For typical 10-fold and leave-one-out cross-validations, the proposed algorithm is approximately 4 and (4/9n) times respectively as efficient as the naive implementations. Simulations are provided to demonstrate the efficiency of the proposed algorithms.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper proposes a sparse modeling approach to solve ordinal regression problems using Gaussian processes (GP). Designing a sparse GP model is important from training time and inference time viewpoints. We first propose a variant of the Gaussian process ordinal regression (GPOR) approach, leave-one-out GPOR (LOO-GPOR). It performs model selection using the leave-one-out cross-validation (LOO-CV) technique. We then provide an approach to design a sparse model for GPOR. The sparse GPOR model reduces computational time and storage requirements. Further, it provides faster inference. We compare the proposed approaches with the state-of-the-art GPOR approach on some benchmark data sets. Experimental results show that the proposed approaches are competitive.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents an efficient construction algorithm for obtaining sparse kernel density estimates based on a regression approach that directly optimizes model generalization capability. Computational efficiency of the density construction is ensured using an orthogonal forward regression, and the algorithm incrementally minimizes the leave-one-out test score. A local regularization method is incorporated naturally into the density construction process to further enforce sparsity. An additional advantage of the proposed algorithm is that it is fully automatic and the user is not required to specify any criterion to terminate the density construction procedure. This is in contrast to an existing state-of-art kernel density estimation method using the support vector machine (SVM), where the user is required to specify some critical algorithm parameter. Several examples are included to demonstrate the ability of the proposed algorithm to effectively construct a very sparse kernel density estimate with comparable accuracy to that of the full sample optimized Parzen window density estimate. Our experimental results also demonstrate that the proposed algorithm compares favorably with the SVM method, in terms of both test accuracy and sparsity, for constructing kernel density estimates.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We propose a simple yet computationally efficient construction algorithm for two-class kernel classifiers. In order to optimise classifier's generalisation capability, an orthogonal forward selection procedure is used to select kernels one by one by minimising the leave-one-out (LOO) misclassification rate directly. It is shown that the computation of the LOO misclassification rate is very efficient owing to orthogonalisation. Examples are used to demonstrate that the proposed algorithm is a viable alternative to construct sparse two-class kernel classifiers in terms of performance and computational efficiency.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We propose a simple and computationally efficient construction algorithm for two class linear-in-the-parameters classifiers. In order to optimize model generalization, a forward orthogonal selection (OFS) procedure is used for minimizing the leave-one-out (LOO) misclassification rate directly. An analytic formula and a set of forward recursive updating formula of the LOO misclassification rate are developed and applied in the proposed algorithm. Numerical examples are used to demonstrate that the proposed algorithm is an excellent alternative approach to construct sparse two class classifiers in terms of performance and computational efficiency.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper describes informatics for cross-sample analysis with comprehensive two-dimensional gas chromatography (GCxGC) and high-resolution mass spectrometry (HRMS). GCxGC-HRMS analysis produces large data sets that are rich with information, but highly complex. The size of the data and volume of information requires automated processing for comprehensive cross-sample analysis, but the complexity poses a challenge for developing robust methods. The approach developed here analyzes GCxGC-HRMS data from multiple samples to extract a feature template that comprehensively captures the pattern of peaks detected in the retention-times plane. Then, for each sample chromatogram, the template is geometrically transformed to align with the detected peak pattern and generate a set of feature measurements for cross-sample analyses such as sample classification and biomarker discovery. The approach avoids the intractable problem of comprehensive peak matching by using a few reliable peaks for alignment and peak-based retention-plane windows to define comprehensive features that can be reliably matched for cross-sample analysis. The informatics are demonstrated with a set of 18 samples from breast-cancer tumors, each from different individuals, six each for Grades 1-3. The features allow classification that matches grading by a cancer pathologist with 78% success in leave-one-out cross-validation experiments. The HRMS signatures of the features of interest can be examined for determining elemental compositions and identifying compounds.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this chapter, we elaborate on the well-known relationship between Gaussian processes (GP) and Support Vector Machines (SVM). Secondly, we present approximate solutions for two computational problems arising in GP and SVM. The first one is the calculation of the posterior mean for GP classifiers using a `naive' mean field approach. The second one is a leave-one-out estimator for the generalization error of SVM based on a linear response method. Simulation results on a benchmark dataset show similar performances for the GP mean field algorithm and the SVM algorithm. The approximate leave-one-out estimator is found to be in very good agreement with the exact leave-one-out error.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background Predicting protein subnuclear localization is a challenging problem. Some previous works based on non-sequence information including Gene Ontology annotations and kernel fusion have respective limitations. The aim of this work is twofold: one is to propose a novel individual feature extraction method; another is to develop an ensemble method to improve prediction performance using comprehensive information represented in the form of high dimensional feature vector obtained by 11 feature extraction methods. Methodology/Principal Findings A novel two-stage multiclass support vector machine is proposed to predict protein subnuclear localizations. It only considers those feature extraction methods based on amino acid classifications and physicochemical properties. In order to speed up our system, an automatic search method for the kernel parameter is used. The prediction performance of our method is evaluated on four datasets: Lei dataset, multi-localization dataset, SNL9 dataset and a new independent dataset. The overall accuracy of prediction for 6 localizations on Lei dataset is 75.2% and that for 9 localizations on SNL9 dataset is 72.1% in the leave-one-out cross validation, 71.7% for the multi-localization dataset and 69.8% for the new independent dataset, respectively. Comparisons with those existing methods show that our method performs better for both single-localization and multi-localization proteins and achieves more balanced sensitivities and specificities on large-size and small-size subcellular localizations. The overall accuracy improvements are 4.0% and 4.7% for single-localization proteins and 6.5% for multi-localization proteins. The reliability and stability of our classification model are further confirmed by permutation analysis. Conclusions It can be concluded that our method is effective and valuable for predicting protein subnuclear localizations. A web server has been designed to implement the proposed method. It is freely available at http://bioinformatics.awowshop.com/snlpr​ed_page.php.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Early detection, clinical management and disease recurrence monitoring are critical areas in cancer treatment in which specific biomarker panels are likely to be very important in each of these key areas. We have previously demonstrated that levels of alpha-2-heremans-schmid-glycoprotein (AHSG), complement component C3 (C3), clusterin (CLI), haptoglobin (HP) and serum amyloid A (SAA) are significantly altered in serum from patients with squamous cell carcinoma of the lung. Here, we report the abundance levels for these proteins in serum samples from patients with advanced breast cancer, colorectal cancer (CRC) and lung cancer compared to healthy controls (age and gender matched) using commercially available enzyme-linked immunosorbent assay kits. Logistic regression (LR) models were fitted to the resulting data, and the classification ability of the proteins was evaluated using receiver-operating characteristic curve and leave-one-out cross-validation (LOOCV). The most accurate individual candidate biomarkers were C3 for breast cancer [area under the curve (AUC) = 0.89, LOOCV = 73%], CLI for CRC (AUC = 0.98, LOOCV = 90%), HP for small cell lung carcinoma (AUC = 0.97, LOOCV = 88%), C3 for lung adenocarcinoma (AUC = 0.94, LOOCV = 89%) and HP for squamous cell carcinoma of the lung (AUC = 0.94, LOOCV = 87%). The best dual combination of biomarkers using LR analysis were found to be AHSG + C3 (AUC = 0.91, LOOCV = 83%) for breast cancer, CLI + HP (AUC = 0.98, LOOCV = 92%) for CRC, C3 + SAA (AUC = 0.97, LOOCV = 91%) for small cell lung carcinoma and HP + SAA for both adenocarcinoma (AUC = 0.98, LOOCV = 96%) and squamous cell carcinoma of the lung (AUC = 0.98, LOOCV = 84%). The high AUC values reported here indicated that these candidate biomarkers have the potential to discriminate accurately between control and cancer groups both individually and in combination with other proteins. Copyright © 2011 UICC.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

OBJECTIVE: This study explored gene expression differences in predicting response to chemoradiotherapy in esophageal cancer. PURPOSE:: A major pathological response to neoadjuvant chemoradiation is observed in about 40% of esophageal cancer patients and is associated with favorable outcomes. However, patients with tumors of similar histology, differentiation, and stage can have vastly different responses to the same neoadjuvant therapy. This dichotomy may be due to differences in the molecular genetic environment of the tumor cells. BACKGROUND DATA: Diagnostic biopsies were obtained from a training cohort of esophageal cancer patients (13), and extracted RNA was hybridized to genome expression microarrays. The resulting gene expression data was verified by qRT-PCR. In a larger, independent validation cohort (27), we examined differential gene expression by qRT-PCR. The ability of differentially-regulated genes to predict response to therapy was assessed in a multivariate leave-one-out cross-validation model. RESULTS: Although 411 genes were differentially expressed between normal and tumor tissue, only 103 genes were altered between responder and non-responder tumor; and 67 genes differentially expressed >2-fold. These included genes previously reported in esophageal cancer and a number of novel genes. In the validation cohort, 8 of 12 selected genes were significantly different between the response groups. In the predictive model, 5 of 8 genes could predict response to therapy with 95% accuracy in a subset (74%) of patients. CONCLUSIONS: This study has identified a gene microarray pattern and a set of genes associated with response to neoadjuvant chemoradiation in esophageal cancer. The potential of these genes as biomarkers of response to treatment warrants further investigation. Copyright © 2009 by Lippincott Williams & Wilkins.